California Housing Prices Statistical Analysis and Prediction¶

California Housing Prices Statistical Analysis and Prediction

  1. Introduction
  2. Objective
  3. Dataset
  4. Data Cleaning / Preparation
  5. Exploratory Data Analysis
    • Distribution Analysis
    • Relationship Analysis
    • Correlation Analysis
    • Box Plot Analysis
    • Pair Plot Analysis
    • Exploratory Data Analysis Summary
    • Exploratory Data Analysis Conclusion
  6. Model Selection:
    • Model Selection Criteria
    • Planned Models for Selection
    • Steps for Models Selection
  7. Model Analysis
    • Model Analysis Criteria
    • Sampling
    • Linear Regression (Median House Value using Median Income)
    • Linear Regression (Using Longitude and Latitude to predict house prices based on location)
    • Multiple Linear Regression (Using Median Income, Total Rooms, Population)
    • Ridge Regression (Using Median Income)
    • Lasso Regression (Using Median Income)
    • Polynomial Regression (Using Median Income)
    • Random Forest Regression (Using Median Income)
  8. Comparative Analysis, interpretation of models and visualization
  9. Hypothesis validation and testing
  10. Conclusion
  11. References

1. Introduction:¶


House prices in California depend on several socio-economic and geographical factors; hence, the prediction becomes multi-faceted and very complex. With the increasing demand for affordable housing, it is vital that policymakers, investors, and residents be abreast of the dynamics that drive property values. Using a dataset from the 1990 California census, in this project, we attempt to predict median house prices using machine learning based on influential variables: median income, location, and property characteristics.

2. Objective:¶


The objective of this project is to carry out a thorough statistical analysis and develop predictive models that can estimate median housing prices in California. Various methods of multiple regression analysis, such as linear regression, ridge, lasso, polynomial, and random forest regressions, will be considered in this study. It will try to find the most accurate model to predict house prices. Given this, the project tries to find the key drivers of California housing prices using feature importance and evaluate the performance of the model using Machine Learning in real estate valuation.

3. Dataset:¶


The California Housing Prices dataset from Kaggle shows a snapshot taken from the 1990 census. This dataset has mainly been used for regression tasks in machine learning regarding median house price prediction. Let's take a glimpse into the dataset:

Dataset Features: The dataset includes 20,640 entries with the following 10 columns:

  1. longitude: The geographical longitude of the district (float).
  2. latitude: The geographical latitude of the district (float).
  3. housing_median_age: The median age of the houses in the district (float).
  4. total_rooms: The total number of rooms in the district (float).
  5. total_bedrooms: The total number of bedrooms in the district (float, some missing values).
  6. population: The population of the district (float).
  7. households: The number of households in the district (float).
  8. median_income: The median income for households in the district, in tens of thousands of dollars (float).
  9. median_house_value: The median house value for households in the district, in US dollars (float, target variable).
  10. ocean_proximity: The categorical variable indicating proximity to the ocean, with values like NEAR BAY, INLAND, NEAR OCEAN, and ISLAND.

Package Requirements:¶

In [437]:
!pip install folium
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: folium in c:\users\swapn\appdata\roaming\python\python312\site-packages (0.17.0)
Requirement already satisfied: branca>=0.6.0 in c:\users\swapn\appdata\roaming\python\python312\site-packages (from folium) (0.8.0)
Requirement already satisfied: jinja2>=2.9 in c:\programdata\anaconda3\lib\site-packages (from folium) (3.1.4)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from folium) (1.26.4)
Requirement already satisfied: requests in c:\programdata\anaconda3\lib\site-packages (from folium) (2.32.2)
Requirement already satisfied: xyzservices in c:\programdata\anaconda3\lib\site-packages (from folium) (2022.9.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\programdata\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\programdata\anaconda3\lib\site-packages (from requests->folium) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests->folium) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests->folium) (2.2.2)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests->folium) (2024.8.30)

Imports:¶

In [438]:
import pandas as pd
import numpy as np
import seaborn as sns
import folium
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, roc_curve, auc
from scipy import stats
from folium.plugins import HeatMap
from mpl_toolkits.mplot3d import Axes3D

# Setting up inline plotting
%matplotlib inline

Data Import:

In [439]:
data = pd.read_csv('dataset/housing_original.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [440]:
data.head()
Out[440]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

These statistics provide a foundational understanding of the dataset's structure, indicating the diverse housing and economic landscape across California.

  1. Longitude and Latitude: The dataset covers a geographic area with longitude ranging from -124.35 to -114.31 and latitude from 32.54 to 41.95, corresponding to California’s general geographic boundaries.
  2. Housing Median Age: The median housing age is around 28.6 years, with a minimum of 1 and a maximum of 52 years, indicating that most homes fall within a mid-aged range.
  3. Total Rooms and Bedrooms: Total rooms range from 2 to 39,320, and total bedrooms from 1 to 6,445, highlighting a large variation in housing sizes. Median total rooms and bedrooms are 2,127 and 435, respectively, suggesting that most houses are moderately sized.
  4. Population and Households: Population per block group varies from as low as 3 to as high as 35,682, while households range from 1 to 6,082, showing significant demographic differences across regions.
  5. Median Income: Median income ranges widely, from $0.5k to $15k (in tens of thousands), with an average of $3.87k, indicating economic diversity across neighborhoods.
  6. Median House Value (Target Variable): House values span from $14,999 to $500,001, with a mean of $206,855. The maximum value is capped at 500,001, suggesting the data might have been truncated.
In [441]:
data.describe()
Out[441]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

4. Data Cleaning:¶


  1. Handling Missing Values: The missing values in the total_bedrooms column are addressed by imputing them with the median value of the column. This approach maintains the central tendency and mitigates the impact of outliers in the data.
In [442]:
# Drop rows with missing values
data.dropna(inplace=True)

# Drop duplicate rows
data.drop_duplicates(inplace=True)
  1. Detecting and Handling Outliers: To ensure the dataset's integrity, the Interquartile Range (IQR) method is applied to detect and handle potential outliers, focusing on the median_house_value column. This process helps in reducing the skewness caused by extreme values.
In [443]:
# Define a function to remove outliers based on the IQR method
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Filter out rows with values outside the lower and upper bounds
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Apply outlier removal for 'median_house_value'
data = remove_outliers(data, 'median_house_value').copy()
  1. Encoding the Categorical Variable: The categorical feature ocean_proximity is encoded using one-hot encoding. This method converts the categorical variable into a numeric format that can be used in further analysis and predictive modeling.
In [444]:
# Perform one-hot encoding for 'ocean_proximity'
data = pd.get_dummies(data, columns=['ocean_proximity'], drop_first=True)
  1. Verification of Data Cleaning: Verifying that the dataset no longer contains missing values and that the categorical variable has been appropriately encoded. This ensures data consistency before proceeding to the analysis phase.
In [445]:
# Check for any remaining missing values and confirm data structure
data.info()
data.head()
data.describe()
<class 'pandas.core.frame.DataFrame'>
Index: 19369 entries, 0 to 20639
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   longitude                   19369 non-null  float64
 1   latitude                    19369 non-null  float64
 2   housing_median_age          19369 non-null  float64
 3   total_rooms                 19369 non-null  float64
 4   total_bedrooms              19369 non-null  float64
 5   population                  19369 non-null  float64
 6   households                  19369 non-null  float64
 7   median_income               19369 non-null  float64
 8   median_house_value          19369 non-null  float64
 9   ocean_proximity_INLAND      19369 non-null  bool   
 10  ocean_proximity_ISLAND      19369 non-null  bool   
 11  ocean_proximity_NEAR BAY    19369 non-null  bool   
 12  ocean_proximity_NEAR OCEAN  19369 non-null  bool   
dtypes: bool(4), float64(9)
memory usage: 1.6 MB
Out[445]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 19369.000000 19369.000000 19369.000000 19369.000000 19369.000000 19369.000000 19369.000000 19369.000000 19369.000000
mean -119.563902 35.655784 28.344158 2620.710930 539.893335 1442.285043 501.303991 3.665475 190802.064949
std 2.005895 2.151468 12.503931 2187.046669 422.650225 1145.780125 383.339200 1.556776 95404.934086
min -124.350000 32.540000 1.000000 2.000000 2.000000 3.000000 2.000000 0.499900 14999.000000
25% -121.760000 33.930000 18.000000 1440.000000 297.000000 798.000000 282.000000 2.522300 116100.000000
50% -118.510000 34.270000 28.000000 2110.000000 437.000000 1181.000000 411.000000 3.442700 173200.000000
75% -117.990000 37.730000 37.000000 3119.000000 648.000000 1746.000000 606.000000 4.572400 246400.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 482200.000000

5. Exploratory Data Analysis:¶


Exploratory Data Analysis (EDA) provides insights into the California housing dataset by visualizing the distribution of house values and exploring relationships between features like median income, location, and ocean proximity. This helps identify key predictors for housing prices and prepares the data for modeling.

1. Distribution Analysis¶

The Distribution Analysis examines the spread and central tendencies of key features, such as median house value and median income, to understand their variability and highlight any skewness in the dataset. 1. Histogram of Median House Value 1. Distribution of Median Income

In [446]:
plt.figure(figsize=(12, 8))
# Plotting the histogram of 'median_house_value' with more bins for granularity
sns.histplot(data['median_house_value'], kde=True, color='skyblue', bins=60)

# Adding vertical lines to show mean and median values for better insight
print("Figure 5.1.1: Distribution of Median House Value in California")
plt.axvline(x=data['median_house_value'].mean(), color='red', linestyle='--', linewidth=2, label='Mean Value')
plt.axvline(x=data['median_house_value'].median(), color='green', linestyle='-', linewidth=2, label='Median Value')
plt.title('Distribution of Median House Value in California', fontsize=18)
plt.xlabel('Median House Value (in USD)', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend(fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
mean_value = data['median_house_value'].mean()
median_value = data['median_house_value'].median()
plt.text(mean_value + 5000, plt.ylim()[1] * 0.7, f'Mean: ${mean_value:.0f}', color='red', fontsize=12, rotation=90)
plt.text(median_value + 5000, plt.ylim()[1] * 0.5, f'Median: ${median_value:.0f}', color='green', fontsize=12, rotation=90)
plt.show()
Figure 5.1.1: Distribution of Median House Value in California
No description has been provided for this image
In [447]:
# Distribution of Median Income Visualization
plt.figure(figsize=(12, 8))
sns.histplot(data['median_income'], kde=True, color='skyblue', bins=50)

# Adding vertical lines for the mean and median income for better insight
print("Figure 5.1.2: Distribution of Median Income in California Housing Data")
plt.axvline(x=data['median_income'].mean(), color='red', linestyle='--', linewidth=2, label='Mean Income')
plt.axvline(x=data['median_income'].median(), color='blue', linestyle='-', linewidth=2, label='Median Income')
plt.title('Distribution of Median Income in California Housing Data', fontsize=18)
plt.xlabel('Median Income (in tens of thousands)', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend(fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Annotating the mean and median values
mean_income = data['median_income'].mean()
median_income = data['median_income'].median()
plt.text(mean_income + 0.1, plt.ylim()[1] * 0.7, f'Mean: ${mean_income:.2f}', color='red', fontsize=12, rotation=90)
plt.text(median_income + 0.1, plt.ylim()[1] * 0.5, f'Median: ${median_income:.2f}', color='blue', fontsize=12, rotation=90)
plt.show()
Figure 5.1.2: Distribution of Median Income in California Housing Data
No description has been provided for this image

2. Relationship Analysis¶

The Relationship Analysis explores how different features, such as median income and geographic location, correlate with median house value, providing insights into key factors that influence housing prices in the dataset. 1. Scatter Plot: Median Income vs. Median House Value 1. Scatter Plot: Latitude and Longitude vs. Median House Value

In [448]:
plt.figure(figsize=(12, 8))
sns.scatterplot(x='median_income', y='median_house_value', data=data, alpha=0.6,
                hue='median_house_value', size='median_income', palette='viridis', sizes=(20, 200))

# Adding a regression line for better insight into the trend
print("Figure 5.2.1: Relationship between Median Income and Median House Value")
sns.regplot(x='median_income', y='median_house_value', data=data, scatter=False, color='red', line_kws={'linestyle':'--'}, ci=None)
plt.title('Relationship between Median Income and Median House Value', fontsize=18)
plt.xlabel('Median Income (in tens of thousands)', fontsize=14)
plt.ylabel('Median House Value (in USD)', fontsize=14)
plt.legend(title='House Value', loc='upper left', fontsize=12, title_fontsize=14)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Figure 5.2.1: Relationship between Median Income and Median House Value
No description has been provided for this image
In [449]:
# Create a figure with 1 row and 3 columns
fig, axs = plt.subplots(1, 3, figsize=(24, 8))

print("Figure 5.2.2: Relationship between Median Income and Median House Value by Ocean Proximity") 
# Plot 1: Median Income by Location
scatter1 = axs[0].scatter(data['longitude'], data['latitude'], c=data['median_income'], cmap='plasma', alpha=0.6, edgecolor='k', linewidth=0.5)
axs[0].set_xlabel('Longitude', fontsize=12)
axs[0].set_ylabel('Latitude', fontsize=12)
axs[0].set_title('Median Income by Location', fontsize=16)
fig.colorbar(scatter1, ax=axs[0], label='Income ($k)')

# Plot 2: Housing Median Age by Location
scatter2 = axs[1].scatter(data['longitude'], data['latitude'], c=data['housing_median_age'], cmap='cividis', alpha=0.8, edgecolor='k', linewidth=0.5)
axs[1].set_xlabel('Longitude', fontsize=12)
axs[1].set_ylabel('Latitude', fontsize=12)
axs[1].set_title('Housing Median Age by Location', fontsize=16)
fig.colorbar(scatter2, ax=axs[1], label='Housing Age (Years)')

# Plot 3: Median House Value by Location
scatter3 = axs[2].scatter(data['longitude'], data['latitude'], c=data['median_house_value'], cmap='magma', alpha=0.6, edgecolor='k', linewidth=0.5)
axs[2].set_xlabel('Longitude', fontsize=12)
axs[2].set_ylabel('Latitude', fontsize=12)
axs[2].set_title('Median House Value by Location', fontsize=16)
fig.colorbar(scatter3, ax=axs[2], label='House Value ($)')

plt.tight_layout(pad=3.0)
plt.subplots_adjust(top=0.92)
fig.suptitle('Geospatial Analysis of California Housing Data', fontsize=20, y=1.05)
plt.show()


# Creating a map centered around California
print("Figure 5.2.3: Heatmap of California Housing Data")
california_map = folium.Map(location=[36.7783,-119.4179], zoom_start = 6, min_zoom=4)
df_map = data[['latitude', 'longitude']]
heatmap_data = [[row['latitude'], row['longitude']] for index, row in df_map.iterrows()]
HeatMap(
    heatmap_data,
    radius=15,
    blur=10,
    max_zoom=1,
    min_opacity=0.3,
).add_to(california_map)
california_map
Figure 5.2.2: Relationship between Median Income and Median House Value by Ocean Proximity
No description has been provided for this image
Figure 5.2.3: Heatmap of California Housing Data
Out[449]:
Make this Notebook Trusted to load map: File -> Trust Notebook

3. Correlation Analysis¶

Correlation analysis identifies the strength of linear relationships between numerical features in the dataset, highlighting which variables are most closely associated with median house value and helping to detect potential multicollinearity.

In [450]:
print("Figure 5.3.1: Correlation Heatmap of California Housing Dataset Features")
plt.figure(figsize=(14, 10))
sns.heatmap(
    data.corr(), 
    annot=True, 
    cmap='RdBu_r', # Reversed Red-Blue colormap to highlight both positive and negative correlations distinctly
    linewidths=1, 
    linecolor='white',
    annot_kws={'size': 10},
    fmt=".2f", 
    cbar_kws={'shrink': 0.8, 'aspect': 30, 'label': 'Correlation Coefficient'}
)
plt.title('Correlation Heatmap of California Housing Dataset Features', fontsize=18, pad=20)
plt.xticks(fontsize=12, rotation=45, ha='right')
plt.yticks(fontsize=12, rotation=0)
plt.tight_layout()
plt.show()
Figure 5.3.1: Correlation Heatmap of California Housing Dataset Features
No description has been provided for this image

4. Box Plot Analysis¶

Box Plot Analysis provides a visual summary of how median house values vary across different categories, such as ocean proximity, helping to identify differences and patterns in housing prices across categorical groups in the dataset.

In [451]:
data['ocean_proximity'] = data[['ocean_proximity_INLAND', 'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN']].idxmax(axis=1).str.replace('ocean_proximity_', '')
print("Figure 5.4.1: Boxplot of Median House Value by Ocean Proximity")
sns.boxplot(x='ocean_proximity', y='median_house_value', data=data, hue='ocean_proximity', palette='Set3', width=0.6, fliersize=4, legend=False)
plt.title('Median House Value by Ocean Proximity', fontsize=20, pad=20)
plt.xlabel('Ocean Proximity', fontsize=15)
plt.ylabel('Median House Value (in USD)', fontsize=15)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Figure 5.4.1: Boxplot of Median House Value by Ocean Proximity
No description has been provided for this image

5. Pair Plot Analysis¶

The Pair Plot provides a visual overview of pairwise relationships between selected features, helping to identify trends, correlations, and potential interactions in the dataset.

In [452]:
selected_features = ['median_house_value', 'median_income', 'total_rooms', 'housing_median_age', 'population']
# Sampling the data for better readability and performance
sampled_data = data[selected_features].sample(5000, random_state=42)
print("Figure 5.5.1: Pair Plot of Selected Features from California Housing Data")
sns.pairplot(sampled_data, diag_kind='kde', kind='scatter', 
             plot_kws={'alpha': 0.5, 's': 15, 'edgecolor': 'k'}, diag_kws={'fill': True})
plt.suptitle('Pair Plot of Selected Features from California Housing Data', y=1.02, fontsize=18)
plt.show()
Figure 5.5.1: Pair Plot of Selected Features from California Housing Data
No description has been provided for this image

6. Exploratory Data Analysis Summary¶

  1. Distribution Analysis:
    • Median House Value: The distribution is right-skewed, with most houses priced below $300,000. The mean value is slightly higher than the median, showing the influence of higher-priced houses.
    • Median Income: The median income distribution is also right-skewed, with the majority of households earning between $2,000 and $5,000 (in tens of thousands), indicating some high-income households skewing the average.
  2. Relationship Analysis:
    • Median Income vs. Median House Value: The scatter plot shows a clear positive relationship between income and house value. Higher income tends to lead to higher house values, as demonstrated by a distinct upward trend in the plot.
    • Geospatial Relationship: House values and incomes are significantly higher near the coast, particularly in areas like Los Angeles and the Bay Area, emphasizing the importance of location in determining house prices.
  3. Correlation Heatmap:
    • The correlation heatmap shows that median income has a strong positive correlation with median house value (0.64). Other features such as total rooms, population, and households exhibit strong internal correlations but have limited predictive power for housing prices.
  4. Box Plot Analysis by Ocean Proximity:
    • The box plot of median house value by ocean proximity shows that properties ISLAND, NEAR BAY, and NEAR OCEAN have much higher median values compared to INLAND properties. The coastal proximity of properties significantly increases their value.
  5. Pair Plot Analysis:
    • The pair plot reveals a strong positive correlation between median income and median house value, whereas relationships between other features like total rooms and median house value are less distinct. The distributions on the diagonal further confirm the skewness seen in earlier analyses.
  6. Heatmap:
    • The heatmap provided an interactive visualization of house values across California. Hotspots of high median house values were observed along the coast, particularly near the Bay Area and Southern California. This visualization provided an intuitive understanding of how geography impacts housing prices, reinforcing the importance of location.

7. Exploratory Data Analysis Conclusion¶

  • Income is the strongest predictor of housing price, with clear positive correlations visible in both scatter plots and the correlation heatmap.
  • Geography has a significant impact on house prices, with properties in coastal areas being far more valuable compared to inland ones. The Folium heatmap and geographic scatter plots emphasize this geographic trend.
  • Ocean Proximity is a strong determinant of property prices. Homes closer to the coast or near bays and oceans are consistently higher in value than those located inland.
  • Distributions of Income and Housing Prices show that both are right-skewed, with the presence of high-income households and expensive properties increasing the average beyond the median.

6. Model Selection:¶


The goal of the Model Selection phase is to identify the best predictive model for estimating median house value based on features like median income, location, and other characteristics of California housing data. Model selection is a critical step in machine learning where compare the performance of different models to determine which one fits the data best while maintaining a good balance between bias and variance.

Multiple regression models will be tested to determine their predictive power:

  1. Linear Regression: This model helps to establish a linear relationship between features like median income and median house value. It is simple, interpretable, and a good baseline for understanding the relationships in the data.

  2. Ridge and Lasso Regression: These are forms of regularized linear regression that help control overfitting. Ridge Regression adds a penalty for the magnitude of coefficients to make the model more robust, while Lasso Regression performs feature selection by driving some feature coefficients to zero.

  3. Polynomial Regression: This model captures the non-linear relationship between median income and median house value, allowing us to model complex data patterns that may not fit a straight line.

  4. Random Forest Regression: This is an ensemble model that builds multiple decision trees and averages their predictions. Random Forest helps capture complex interactions between features and is useful for handling large datasets.

Model Selection Criteria:¶

Each model will be evaluated based on performance metrics such as:

  • R-Squared (R²): This metric measures the proportion of variance in the target variable that is explained by the features. A higher R² value indicates a better fit.
  • Mean Squared Error (MSE): MSE measures the average squared difference between actual and predicted values, providing an idea of how accurate the model is.
  • Mean Absolute Error (MAE): This metric captures the average difference between actual and predicted values, offering a direct interpretation of prediction accuracy.

By comparing these metrics across different models, the aim is to find a balance between underfitting and overfitting, thereby selecting the model that best generalizes to unseen data.

Planned Models for Selection:¶

  1. Linear Regression (Median House Value using Median Income)
  2. Linear Regression (Using Longitude and Latitude to predict house prices based on location)
  3. Multiple Linear Regression (Using Median Income, Total Rooms, Population)
  4. Ridge Regression (Using Median Income)
  5. Lasso Regression (Using Median Income)
  6. Polynomial Regression (Using Median Income)
  7. Random Forest Regression (Using Median Income)

Steps for Model Selection:¶

  1. Data Preparation: Splitting the data into training and testing sets for model evaluation.
  2. Model Implementation: Implementing each regression model, starting with Linear Regression.
  3. Model Evaluation: Evaluating models using metrics like R-Squared, MSE, and MAE.

7. Model Analysis:¶


In the Model Analysis phase, each of the trained models will be assessed based on their predictive performance, error metrics, and ability to generalize to unseen data. This involves evaluating the metrics of R-Squared (R²), Mean Squared Error (MSE), and Mean Absolute Error (MAE). Additionally, model behavior, residual analysis, and robustness checks will be considered to determine the best-performing model among those used.

Model Analysis Criteria:¶

  • Performance Metrics:
    • R-Squared (R²): Represents the proportion of variance explained by the model. Higher R² indicates better predictive power.
    • Mean Squared Error (MSE): Measures the average squared error between predictions and actual values, indicating how well the model is performing.
    • Mean Absolute Error (MAE): Captures the average magnitude of prediction errors without considering their direction, providing an overall error value in the original units.
  • Model Complexity:
    • Assessing whether adding complexity improves the model's performance significantly or if it leads to overfitting.
    • Ridge and Lasso Regression are regularized models that help in mitigating overfitting and feature importance, while Polynomial Regression and Random Forest Regression account for potential non-linear relationships.
  • Residual Analysis:
    • Analyzing residuals for patterns that may indicate model biases or misspecifications.
    • Homogeneity of Variance: Residuals should have constant variance.
    • Normality: Residuals should be normally distributed.

Sampling:¶

Sampling in machine learning involves splitting data into a training set for model learning and a testing set for evaluation. This approach helps assess the model's performance on unseen data, providing an estimate of its accuracy and generalization ability.

  1. Train-Test Split: The dataset is divided into training and testing sets, typically with an 80-20 ratio, to allow the model to learn from a subset (training data) and evaluate its performance on unseen data (test data). This approach helps assess the model's generalization ability.
  2. Random Sampling: A random seed (like random_state=42) ensures that the sampling is reproducible, meaning the same train-test split can be achieved each time, which is useful for consistent results and comparisons across different model runs.
In [453]:
sample_test_size = 0.8
sample_random_state = 42

# Selecting the feature and target for the first model: Predicting Median House Value using Median Income
X = data[['median_income']]
y = data['median_house_value'] #target variable

1. Linear Regression (Median House Value using Median Income)¶

This model predicts the median house value based on median income, establishing a simple linear relationship to understand how income influences housing prices.

$$\text{Median House Value} = \beta_0 + \beta_1 \cdot \text{Median Income} + \epsilon$$

where:

  • $\beta_0$ is the intercept,
  • $\beta_1$ is the coefficient for Median Income,
  • $\epsilon$ represents the residuals (errors).
In [454]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=sample_test_size, random_state=sample_random_state)

# Creating and training the Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = linear_reg.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

# Displaying the evaluation metrics
print("Mean Squared Error (MSE):", mse)
print("R-Squared (R²):", r2)
print("Mean Absolute Error (MAE):", mae)

# Residual Analysis
# Calculate residuals
residuals = y_test - y_pred

# Residual Distribution Plot
print("Figure 7.1.1: Residual Distribution for Linear Regression (Median Income vs. Median House Value)")
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True, color='skyblue', bins=30)

# Adding vertical lines for mean and median
plt.axvline(residuals.mean(), color='red', linestyle='--', linewidth=2, label='Mean')
plt.axvline(residuals.median(), color='orange', linestyle='-', linewidth=2, label='Median')

plt.title('Residual Distribution for Linear Regression (Median Income vs. Median House Value)', fontsize=16)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
plt.figure(figsize=(10, 6))

sc = plt.scatter(y_test, y_pred, alpha=0.6, c=y_pred, cmap='coolwarm', edgecolor='k', s=40)

# Line of perfect prediction
print("Figure 7.1.2: Predicted vs. Actual Median House Value for Linear Regression (Median Income vs. Median House Value)")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual Median House Value', fontsize=12)
plt.ylabel('Predicted Median House Value', fontsize=12)
plt.title('Predicted vs. Actual Median House Value', fontsize=16)
plt.colorbar(sc, label='Predicted House Value')
plt.grid(True)
plt.legend()
plt.show()
Mean Squared Error (MSE): 5310124648.141448
R-Squared (R²): 0.4153178441063634
Mean Absolute Error (MAE): 55706.32803939273
Figure 7.1.1: Residual Distribution for Linear Regression (Median Income vs. Median House Value)
No description has been provided for this image
Figure 7.1.2: Predicted vs. Actual Median House Value for Linear Regression (Median Income vs. Median House Value)
No description has been provided for this image

Interpretation: The Linear Regression model, using median income to predict median house value, yields the following insights:

  • $MSE$ of approximately 5.31 billion indicates large prediction errors, particularly for high-value homes.
  • $R^2$ of 0.415 shows that median income explains about 43.4% of the variance in median house values, suggesting that additional features are needed to improve predictive power.
  • $MAE$ of $55,706 reflects significant average prediction errors, indicating that the model’s accuracy may be insufficient for precise pricing predictions.

Chart Analysis:

  • Residual Distribution: Residuals are centered around zero but show a slight skew, with larger errors in high-value predictions.
  • Predicted vs. Actual Scatter: While there is a positive trend, deviations from the perfect prediction line, especially at high values, highlight limitations in using a single feature.

2. Linear Regression (Using Longitude and Latitude to predict house prices based on location)¶

This model predicts median house value based on geographical coordinates, aiming to capture the impact of location on housing prices.

$$\text{Median House Value} = \beta_0 + \beta_1 \cdot \text{Longitude} + \beta_2 \cdot \text{Latitude} + \epsilon$$

where:

  • $\beta_0$ is the intercept,
  • $\beta_1$ and $\beta_2$ are the coefficients for Longitude and Latitude,
  • $\epsilon$ represents the residuals (errors).
In [455]:
# Selecting features (longitude, latitude) and target (median house value)
X_loc = data[['longitude', 'latitude']]

# Splitting the dataset into training and testing sets
X_train_loc, X_test_loc, y_train_loc, y_test_loc = train_test_split(X_loc, y, test_size=sample_test_size, random_state=sample_random_state)

# Creating and training the Linear Regression model
linear_reg_loc = LinearRegression()
linear_reg_loc.fit(X_train_loc, y_train_loc)

# Making predictions on the testing set
y_pred_loc = linear_reg_loc.predict(X_test_loc)

# Evaluating the model
mse_loc = mean_squared_error(y_test_loc, y_pred_loc)
r2_loc = r2_score(y_test_loc, y_pred_loc)
mae_loc = mean_absolute_error(y_test_loc, y_pred_loc)

# Displaying the evaluation metrics
print("Mean Squared Error (MSE):", mse_loc)
print("R-Squared (R²):", r2_loc)
print("Mean Absolute Error (MAE):", mae_loc)

# Residual Analysis - Calculate residuals
residuals_loc = y_test_loc - y_pred_loc

# Residual Distribution Plot
plt.figure(figsize=(10, 6))
sns.histplot(residuals_loc, kde=True, color='skyblue', bins=30)

# Adding vertical lines for mean and median
plt.axvline(residuals_loc.mean(), color='red', linestyle='--', linewidth=2, label='Mean')
plt.axvline(residuals_loc.median(), color='orange', linestyle='-', linewidth=2, label='Median')

# Adding title, labels, legend, and grid
print("Figure 7.2.1: Residual Distribution for Linear Regression (Longitude & Latitude vs. Median House Value)")
plt.title('Residual Distribution for Linear Regression (Longitude & Latitude vs. Median House Value)', fontsize=16)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Predicted vs. Actual Scatter Plot
plt.figure(figsize=(10, 6))

# Scatter plot with color gradient
sc = plt.scatter(y_test_loc, y_pred_loc, alpha=0.6, c=y_pred_loc, cmap='coolwarm', s=40)

# Line of perfect prediction
plt.plot([y_test_loc.min(), y_test_loc.max()], [y_test_loc.min(), y_test_loc.max()], 'r--', linewidth=2, label='Perfect Prediction')

# Adding title, labels, color bar, and grid
print("Figure 7.2.2: Predicted vs. Actual Median House Value for Linear Regression (Longitude & Latitude vs. Median House Value)")
plt.xlabel('Actual Median House Value', fontsize=12)
plt.ylabel('Predicted Median House Value', fontsize=12)
plt.title('Predicted vs. Actual Median House Value (Longitude & Latitude)', fontsize=16)
plt.colorbar(sc, label='Predicted House Value')
plt.grid(True)
plt.legend()
plt.show()

# Folium Heatmap for Predicted Median House Values
# Prepare data for heatmap
print("Figure 7.2.3: Heatmap of Predicted Median House Values in California")
heat_data = [[row['latitude'], row['longitude'], prediction] for row, prediction in zip(X_test_loc.to_dict('records'), y_pred_loc)]

# Create base map
california_map = folium.Map(location=[36.7783, -119.4179], zoom_start=6, min_zoom=4)

# Add HeatMap layer
HeatMap(heat_data, radius=8, blur=15, max_zoom=13).add_to(california_map)

# Display the map
california_map
Mean Squared Error (MSE): 6766043305.090507
R-Squared (R²): 0.2550109369137644
Mean Absolute Error (MAE): 64215.74292257116
Figure 7.2.1: Residual Distribution for Linear Regression (Longitude & Latitude vs. Median House Value)
No description has been provided for this image
Figure 7.2.2: Predicted vs. Actual Median House Value for Linear Regression (Longitude & Latitude vs. Median House Value)
No description has been provided for this image
Figure 7.2.3: Heatmap of Predicted Median House Values in California
Out[455]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Interpretation: The Linear Regression model, using longitude and latitude to predict median house value, yields the following insights:

  • $MSE$ of approximately 6.76 billion indicates large prediction errors, particularly for homes in high-value locations, as location alone may not fully capture price determinants.
  • $R^2$ of 0.255 shows that longitude and latitude explain only about 25.4% of the variance in median house values, suggesting that additional features are essential for more accurate predictions.
  • $MAE$ of $64,215 reflects considerable average prediction errors, indicating limited precision in predicting housing prices based solely on location.

Chart Analysis:

  • Residual Distribution Plot:

    • The residuals are centered around zero, indicating no strong overall bias. However, the distribution is broad, with substantial errors on both the positive and negative ends, reflecting inconsistent prediction accuracy.
    • The slight skew in the distribution and the long tails indicate that the model struggles with outliers, particularly with homes in higher or lower price ranges.
    • The mean and median residuals being close to zero confirm that the model does not have systematic bias but lacks precision due to missing influential predictors.
  • Predicted vs. Actual Scatter Plot:

    • The scatter plot shows a weak positive trend, with points scattered widely around the line of perfect prediction. The large dispersion indicates that the model has substantial difficulty in accurately predicting housing prices.
    • The color gradient (from blue to red) represents predicted values, showing that higher predicted prices often deviate significantly from actual values, further highlighting the limitations of location-only predictors.
    • This scatter pattern suggests that latitude and longitude alone do not fully capture the variability in housing prices, as homes in similar geographic areas may still have vastly different prices due to factors not included in this model, such as house size, income levels, or neighborhood characteristics.
  • Heatmap:

    • The heatmap shows areas of higher predicted house prices, with concentrations around major cities like Los Angeles, San Francisco, and San Diego. These areas show intense heat, indicating the model's higher predicted values in urban centers.
    • While the heatmap provides a useful geographical representation, it cannot account for finer details that might affect house prices within these areas, such as neighborhood quality or property features, which further limits the accuracy of predictions based solely on location.

3. Multiple Linear Regression (Using Median Income, Total Rooms, Population)¶

This model predicts the median house value based on median income, total rooms, and population, establishing a linear relationship to understand how these features collectively influence housing prices. $$ \text{Median House Value} = \beta_0 + \beta_1 \cdot \text{Median Income} + \beta_2 \cdot \text{Total Rooms} + \beta_3 \cdot \text{Population} + \epsilon $$

where:

  • $\beta_0$ is the intercept,
  • $\beta_1$, $\beta_2$, and $\beta_3$ are the coefficients for Median Income, Total Rooms, and Population,
  • $\epsilon$ represents the residuals.
In [456]:
# define our predictor variables
X_multi = data[['median_income', 'total_rooms', 'population']]

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_multi, y, test_size=sample_test_size, random_state=sample_random_state)

# train the multiple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# make predictions on the test data
y_pred = model.predict(X_test)

# calculate the R-squared score and mean squared error
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

# plotting our results
print("Figure 7.3.1: Residual Distribution for Multiple Linear Regression")
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# scatter plot for the actual values
ax.scatter(X_test['median_income'], X_test['total_rooms'], y_test, color='blue', label='Actual Values', alpha=0.6)

# scatter plot of the predicted values
ax.scatter(X_test['median_income'], X_test['total_rooms'], y_pred, color='red', label='Predicted Values', alpha=0.6)

# set out plot labels
ax.set_xlabel('Median Income')
ax.set_ylabel('Total Rooms')
ax.set_zlabel('Median House Value')
ax.set_title('Multiple Linear Regression: Actual vs Predicted House Prices')
ax.legend()

plt.show()
Mean Squared Error: 5317735338.447184
R-squared Score: 0.41447985345442007
Figure 7.3.1: Residual Distribution for Multiple Linear Regression
No description has been provided for this image

Interpretation:

Chart Analysis:

4. Ridge Regression (Using Median Income)¶

This model predicts the median house value based on median income, using Ridge Regression to apply regularization. This approach penalizes large coefficients, reducing overfitting and improving the model's generalization by controlling the impact of outliers or noise in the data. $$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} \left( y_i - (\beta_0 + \beta_1 \cdot \text{Median Income}_i) \right)^2 + \lambda \cdot \beta_1^2$$

where:

  • $y_i$ is the actual Median House Value for observation $i$,
  • $\lambda$ is the regularization parameter controlling the penalty on the coefficient $\beta_1$.
In [457]:
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=sample_test_size, random_state=sample_random_state)

# train the Ridge regression model
ridge_model = Ridge(alpha=1.0) 
ridge_model.fit(X_train, y_train)

# make predictions on the test data
y_pred = ridge_model.predict(X_test)

# calculate alculate Mean Squared Error and R-squared Score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

# plotting ridge regrssion
print("Figure 7.4.1: Ridge Regression Line for Median Income vs Median House Value")
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual Values', alpha=0.6)
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Ridge Regression Line')
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Ridge Regression: Median Income vs Median House Value')
plt.legend()
plt.show()
Mean Squared Error: 5310126206.77687
R-squared Score: 0.41531767248961626
Figure 7.4.1: Ridge Regression Line for Median Income vs Median House Value
No description has been provided for this image

Interpretation:

Chart Analysis:

5. Lasso Regression (Using Median Income)¶

This model predicts the median house value based on median income, using Lasso Regression to apply regularization. Lasso Regression not only reduces overfitting by penalizing large coefficients but also performs feature selection by driving some coefficients to zero, thus simplifying the model and enhancing interpretability. $$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} \left( y_i - (\beta_0 + \beta_1 \cdot \text{Median Income}_i) \right)^2 + \lambda \cdot |\beta_1|$$

where:

  • $\lambda$ is the regularization parameter controlling the penalty on the coefficient $\beta_1$,
  • $|\beta_1|$ promotes sparsity in $\beta_1$.
In [458]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=sample_test_size, random_state=sample_random_state)

# train the Lasso regression model
lasso_model = Lasso(alpha=1.0)  
lasso_model.fit(X_train, y_train)

# make predictions on the test data
y_pred = lasso_model.predict(X_test)

# calculate Mean Squared Error and R-squared Score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

# plot the Lasso regression results
print("Figure 7.5.1: Lasso Regression Line for Median Income vs Median House Value")
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual Values', alpha=0.6)
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Lasso Regression Line')
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Lasso Regression: Median Income vs Median House Value')
plt.legend()
plt.show()
Mean Squared Error: 5310124797.548696
R-squared Score: 0.4153178276555721
Figure 7.5.1: Lasso Regression Line for Median Income vs Median House Value
No description has been provided for this image

Interpretation:

Chart Analysis:

6. Polynomial Regression (Using Median Income)¶

This model predicts the median house value based on median income by fitting a polynomial relationship, allowing for the capture of non-linear patterns in the data. By transforming the linear model into a polynomial one, this approach better models the complex relationship between income and housing prices, potentially improving prediction accuracy. $$\text{Median House Value} = \beta_0 + \beta_1 \cdot \text{Median Income} + \beta_2 \cdot \text{Median Income}^2 + \epsilon$$

where:

  • $\beta_2$ is the coefficient for the squared term of Median Income,
  • Higher-degree terms can be added as needed.
In [459]:
# Transforming the feature into polynomial features
poly_transformer = PolynomialFeatures(degree=2)  # Degree can be adjusted
X_poly_transformed = poly_transformer.fit_transform(X)

# Splitting the dataset into training and testing sets
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly_transformed, y, test_size=sample_test_size, random_state=sample_random_state)

# Creating and training the Polynomial Regression model
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train_poly)

# Making predictions on the testing set
y_pred_poly = poly_reg.predict(X_test_poly)

# Evaluating the model
mse_poly = mean_squared_error(y_test_poly, y_pred_poly)
r2_poly = r2_score(y_test_poly, y_pred_poly)
mae_poly = mean_absolute_error(y_test_poly, y_pred_poly)

# Displaying the evaluation metrics
print("Mean Squared Error (MSE):", mse_poly)
print("R-Squared (R²):", r2_poly)
print("Mean Absolute Error (MAE):", mae_poly)

# Visual Analysis
# Plotting Predicted vs. Actual Values
print("Figure 7.6.1: Predicted vs. Actual Median House Value for Polynomial Regression (Median Income)")
plt.figure(figsize=(10, 6))
sc = plt.scatter(y_test_poly, y_pred_poly, alpha=0.6, c=y_pred_loc, cmap='coolwarm', s=40)
plt.plot([y_test_poly.min(), y_test_poly.max()], [y_test_poly.min(), y_test_poly.max()], 'r--', label='Perfect Prediction')
plt.xlabel('Actual Median House Value')
plt.ylabel('Predicted Median House Value')
plt.title('Predicted vs. Actual Median House Value (Polynomial Regression)')
plt.legend()
plt.grid(True)
plt.show()

# Residual Analysis
residuals_poly = y_test_poly - y_pred_poly
print("Figure 7.6.2: Residual Distribution for Polynomial Regression (Median Income)")
plt.figure(figsize=(10, 6))
sns.histplot(residuals_poly, kde=True, color='skyblue', bins=30)
plt.axvline(residuals_poly.mean(), color='red', linestyle='--', label='Mean')
plt.axvline(residuals_poly.median(), color='orange', label='Median')
plt.title('Residual Distribution for Polynomial Regression (Median Income)')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Mean Squared Error (MSE): 5310271000.495122
R-Squared (R²): 0.4153017296805568
Mean Absolute Error (MAE): 55801.45846647627
Figure 7.6.1: Predicted vs. Actual Median House Value for Polynomial Regression (Median Income)
No description has been provided for this image
Figure 7.6.2: Residual Distribution for Polynomial Regression (Median Income)
No description has been provided for this image

Interpretation: The Linear Regression model, using longitude and latitude to predict median house value, yields the following insights:

  • $MSE$ of 5.31 billion – Indicates significant prediction errors, especially at higher house values.
  • $R^2$ of 0.415 explains 41.5% of price variance, suggesting moderate accuracy.
  • $MAE$ of $55,801 explains average deviation shows moderate prediction accuracy.

Chart Analysis:

  • Residual Distribution Plot:
    • Residuals are centered around zero, showing no strong bias. Slight skew and long tails indicate errors, particularly for outliers.
    • Mean and median close to zero confirm balanced prediction errors.
  • Predicted vs. Actual Scatter Plot:
    • Positive trend but notable deviations from the perfect prediction line, especially for high values, show limitations.
    • Dense clustering at lower values indicates better accuracy for mid-range prices, with increased errors at higher values.

7. Random Forest Regression (Using Median Income)¶

This model predicts the median house value based on median income using Random Forest Regression, an ensemble method that builds multiple decision trees and averages their predictions. By capturing complex, non-linear relationships and reducing overfitting through averaging, Random Forest improves prediction accuracy and robustness compared to single linear models. $$\text{Median House Value} = \sum_{t=1}^{T} f_t(\text{Median Income})$$ where:

  • $T$ is the total number of trees in the forest,
  • $f_t(\text{Median Income})$ is the prediction from the $t$-th tree.
In [460]:
# Splitting the dataset into training and testing sets
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X, y, test_size=sample_test_size, random_state=sample_random_state)

# Creating and training the Random Forest Regressor model
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=sample_random_state)
rf_regressor.fit(X_train_rf, y_train_rf)

# Making predictions on the testing set
y_pred_rf = rf_regressor.predict(X_test_rf)

# Evaluating the model
mse_rf = mean_squared_error(y_test_rf, y_pred_rf)
r2_rf = r2_score(y_test_rf, y_pred_rf)
mae_rf = mean_absolute_error(y_test_rf, y_pred_rf)

print("Random Forest Model Evaluation Metrics:")
print("Mean Squared Error (MSE):", mse_rf)
print("R-Squared (R²):", r2_rf)
print("Mean Absolute Error (MAE):", mae_rf)

# Feature Importance (with a single feature, this will show how important median income is in the model)
importances = rf_regressor.feature_importances_
print("Feature Importance (Median Income):", importances[0])

# Visualize Predicted vs Actual Values
print("Figure 7.7.1: Predicted vs. Actual Median House Value for Random Forest Regression")
plt.figure(figsize=(10, 6))
plt.scatter(y_test_rf, y_pred_rf, alpha=0.6, c=y_pred_loc, cmap='coolwarm', s=40)
plt.plot([y_test_rf.min(), y_test_rf.max()], [y_test_rf.min(), y_test_rf.max()], 'r--', label='Perfect Prediction')
plt.xlabel('Actual Median House Value')
plt.ylabel('Predicted Median House Value')
plt.title('Predicted vs. Actual Median House Value (Random Forest Regression)')
plt.legend()
plt.grid(True)
plt.show()
Random Forest Model Evaluation Metrics:
Mean Squared Error (MSE): 7602350029.969873
R-Squared (R²): 0.16292767121077267
Mean Absolute Error (MAE): 66591.09679267
Feature Importance (Median Income): 1.0
Figure 7.7.1: Predicted vs. Actual Median House Value for Random Forest Regression
No description has been provided for this image

Interpretation: The Linear Regression model, using longitude and latitude to predict median house value, yields the following insights:

  • $MSE$ of 7.60 billion – This high MSE indicates substantial prediction errors, suggesting that median income alone may not be sufficient for accurately predicting house prices, even with a non-linear model like Random Forest.
  • $R^2$ of 0.162 – The model explains only 16.2% of the variance in median house values, indicating limited predictive power. Additional features would likely improve model performance significantly.
  • $MAE$ of $65,591 – On average, the model’s predictions deviate from actual values by around $65,734, reflecting moderate to high prediction errors.
  • Feature Importance (Median Income): 1.0 – Since median income is the sole feature, it has full importance in the model, though it clearly does not capture all the necessary information to accurately predict housing prices.

Chart Analysis:

  • Predicted vs. Actual Scatter Plot:
    • Positive Correlation: There is a positive correlation between actual and predicted values, suggesting that median income does have some impact on median house value. However, this relationship is weak, as reflected by the high dispersion around the perfect prediction line.
    • Deviation from Perfect Prediction: Many points deviate significantly from the red dashed line (perfect prediction line), indicating that the model struggles to capture the variability in house prices accurately. The spread of points around the line highlights the model’s limited accuracy with a single feature.
    • Density of Points: The points are widely scattered, particularly for higher house values, indicating that the model’s performance worsens with increasing house prices. This suggests that income alone is insufficient to predict prices accurately, as other influential factors are missing.

8. Comparative Analysis:¶


It provides a brief evaluation of multiple regression models, assessing their effectiveness in predicting California housing prices. By analyzing metrics such as MSE, R-squared, and MAE, this section highlights each model's performance, strengths, and limitations, ultimately guiding the selection of the most suitable model for robust predictions.

In [461]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=sample_test_size, random_state=sample_random_state)

# Dictionary to store evaluation metrics
results = {
    'Model': [],
    'MSE': [],
    'R²': [],
    'MAE': []
}

# Helper function to evaluate and store results
def evaluate_model(model, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    results['Model'].append(model_name)
    results['MSE'].append(mse)
    results['R²'].append(r2)
    results['MAE'].append(mae)

# 1. Linear Regression
evaluate_model(LinearRegression(), "Linear Regression")

# 2. Ridge Regression
evaluate_model(Ridge(alpha=1.0), "Ridge Regression")

# 3. Lasso Regression
evaluate_model(Lasso(alpha=1.0), "Lasso Regression")

# 4. Polynomial Regression (degree 2)
poly_transformer = PolynomialFeatures(degree=2)
X_poly_train = poly_transformer.fit_transform(X_train)
X_poly_test = poly_transformer.transform(X_test)
poly_model = LinearRegression()
poly_model.fit(X_poly_train, y_train)
y_pred_poly = poly_model.predict(X_poly_test)
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
mae_poly = mean_absolute_error(y_test, y_pred_poly)
results['Model'].append("Polynomial Regression")
results['MSE'].append(mse_poly)
results['R²'].append(r2_poly)
results['MAE'].append(mae_poly)

# 5. Random Forest Regression
evaluate_model(RandomForestRegressor(n_estimators=100, random_state=42), "Random Forest Regression")

# Convert results dictionary to DataFrame
results_df = pd.DataFrame(results)

plt.figure(figsize=(14, 12))

print("Figure 8.1: Model Comparison for California Housing Data")

# Plot for Mean Squared Error (MSE)
plt.subplot(3, 1, 1)
plt.plot(results_df['Model'], results_df['MSE'], marker='o', linestyle='-', color='skyblue', linewidth=2, markersize=8)
plt.title("Mean Squared Error (MSE)", fontsize=14)
plt.xlabel("Model", fontsize=12)
plt.ylabel("MSE", fontsize=12)
for i, value in enumerate(results_df['MSE']):
    plt.text(i, value, f"{value:.2e}", ha='center', va='bottom', fontsize=10, color='black')
plt.grid(True, linestyle='--', alpha=0.7)

# Plot for R-Squared (R²)
plt.subplot(3, 1, 2)
plt.plot(results_df['Model'], results_df['R²'], marker='o', linestyle='-', color='salmon', linewidth=2, markersize=8)
plt.title("R-Squared (R²)", fontsize=14)
plt.xlabel("Model", fontsize=12)
plt.ylabel("R²", fontsize=12)
for i, value in enumerate(results_df['R²']):
    plt.text(i, value, f"{value:.2f}", ha='center', va='bottom', fontsize=10, color='black')
plt.grid(True, linestyle='--', alpha=0.7)

# Plot for Mean Absolute Error (MAE)
plt.subplot(3, 1, 3)
plt.plot(results_df['Model'], results_df['MAE'], marker='o', linestyle='-', color='lightgreen', linewidth=2, markersize=8)
plt.title("Mean Absolute Error (MAE)", fontsize=14)
plt.xlabel("Model", fontsize=12)
plt.ylabel("MAE", fontsize=12)
for i, value in enumerate(results_df['MAE']):
    plt.text(i, value, f"{value:.2f}", ha='center', va='bottom', fontsize=10, color='black')
plt.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Displaying the results table for reference
print(results_df)
Figure 8.1: Model Comparison for California Housing Data
No description has been provided for this image
                      Model           MSE        R²           MAE
0         Linear Regression  5.310125e+09  0.415318  55706.328039
1          Ridge Regression  5.310126e+09  0.415318  55706.694168
2          Lasso Regression  5.310125e+09  0.415318  55706.364004
3     Polynomial Regression  5.310271e+09  0.415302  55801.458466
4  Random Forest Regression  7.602350e+09  0.162928  66591.096793

ROC (Receiver Operating Characteristic) curves offer a graphical method to assess the performance of different models by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The Area Under the Curve (AUC) value associated with each ROC curve quantifies the model's ability to distinguish between classes, where a higher AUC signifies better classification accuracy.

In [462]:
# Convert the problem to binary classification based on the median house value
median_value = data['median_house_value'].median()
y_binary = (data['median_house_value'] >= median_value).astype(int)  # 1 for high-value houses, 0 for low-value

# Prepare the predictor variable
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=sample_test_size, random_state=sample_random_state)

# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Lasso Regression": Lasso(alpha=1.0),
    "Polynomial Regression": PolynomialFeatures(degree=2),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=sample_random_state)
}

plt.figure(figsize=(10, 8))

# Loop through each model, fit, predict, and calculate ROC curve
for model_name, model in models.items():
    if model_name == "Polynomial Regression":
        # Transform the features for polynomial regression
        poly_transformer = PolynomialFeatures(degree=2)
        X_train_poly = poly_transformer.fit_transform(X_train)
        X_test_poly = poly_transformer.transform(X_test)
        poly_model = LinearRegression()
        poly_model.fit(X_train_poly, y_train)
        y_prob = poly_model.predict(X_test_poly)
    else:
        # Train other models
        model.fit(X_train, y_train)
        y_prob = model.predict(X_test)
    
    # Calculate ROC curve and AUC for the model
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    
    # Plot the ROC curve
    plt.plot(fpr, tpr, lw=2, label=f'{model_name} (AUC = {roc_auc:.2f})')

# Plot the diagonal line for random guessing
print("Figure 8.2: ROC Curve Comparison for Regression Models")
plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Guessing (AUC = 0.50)')

# Adding plot details
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate", fontsize=12)
plt.ylabel("True Positive Rate", fontsize=12)
plt.title("ROC Curve Comparison for Regression Models", fontsize=14)
plt.legend(loc="lower right", fontsize=10)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()
Figure 8.2: ROC Curve Comparison for Regression Models
No description has been provided for this image

9. Hypothesis Validation and testing:¶


10. Conclusion:¶


11. References:¶


  • Cam Nugent. (2023). California Housing Prices [Data set]. Kaggle. https://www.kaggle.com/datasets/camnugent/california-housing-prices/data
  • Torgo, L. (1997). California Housing Prices. Retrieved from https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html